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A trend in high performance computers that is becoming increasingly popular is the use of symmetric multi- 
processing (SMP) rather than the older paradigm of MPP. MPI codes that ran and scaled well on MPP machines 
can often be run on an SMP machine using the vendor's version of MPI. However, this approach may not make 
optimal use of the (expensive) SMP hardware. More significantly, there are machines like Blue Horizon, an IBM 
SP with 8-way SMP nodes at the San Diego Supercomputer Center that can only support 4 MPI processes per 
node (with the current switch). On such a machine it is imperative to be able to use OpenMP parallelism on the 
node, and MPI between nodes. We describe the challenges of converting MILC MPI code to using a second level 
of OpenMP parallelism, and benchmarks on IBM and Sun computers. 



1. OpenMP and MPI 

Open Multiprocessing (OpenMP or OMP 0) 
and Message Passing Interface (MPI) are two 
strategies for using multiple processors for a sin- 
gle problem. The key difference between them is 
that in MPI, different nodes have their own mem- 
ory and they communicate with each other when 
needed; but with OpenMP, the memory is shared 
between threads. 

Here is an example. Suppose we have a two 
dimensional lattice with 4 sites in each direction, 
and we are using four nodes or threads, as shown 
below. 
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Suppose node/thread 1 corresponds to sites 1, 
2, 5 and 6. In MPI, node 1 has information only 
about sites on that node, namely 1, 2, 5 and 6. If 
it needs information about other sites, for exam- 
ple about sites 3 or 7 which are nearest neighbors 
of sites 2 and 6 respectively, it has to use commu- 
nication routines. 

Contrast this with OpenMP, where all threads 
have access to data for all sites, but thread 1 does 
computations only for sites 1, 2, 5 and 6. 
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Table 1 summarizes the differences between the 
two strategies. A trend towards shared memory 
parallel machines or clusters of Symmetric Mul- 
tiprocessing (SMP) nodes rather than the older 
paradigm of Massively Parallel Processing (MPP) 
machines makes a study of OpenMP parallelism 
timely. OpenMP was designed to exploit certain 
characteristics of shared-memory architectures. 
The ability to directly access memory throughout 
the system (with minimum latency and no ex- 
plicit address mapping) combined with very fast 
shared memory locks, makes shared-memory ar- 
chitectures best suited for supporting OpenMP. 
The advantage of OpenMP is that it is easier to 
program. Unlike MPI, one does not have to worry 
about passing messages between nodes. In this 
paper, we study how OpenMP performs relative 
to MPI, and whether combining the two strate- 
gies gives better performance. 

2. OpenMP DETAILS 

In this section, we give some details about 
how a C code that works for a single processor 
is changed to work on multiple threads. The 
number of threads is determined by an environ- 
ment variable, OMP_NUM_THREADS. The code is ex- 
ecuted serially, on a single thread, until a par- 
allel construct is encountered, which is executed 
on multiple threads and then serial execution is 
resumed. To define a parallel construct, lines be- 
ginning with #pragma omp are added to the code. 
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Table 1 

Comparison between OpenMP and MPI 



Such pragmas are ignored by the usual C com- 
piler, so the code may also be run as an ordinary 
serial code. They are, however, interpreted by 
an OpenMP compiler to identify parallel regions. 
There are several constructs that can be made to 
execute in parallel; here is an example for a "for" 
construct. 

#pragma omp parallel for 
for(i=0;i<N;i++){ 
my_job(i) ; 

} 

This will run the function my_j ob in parallel on 
different threads. Note that though the memory 
is shared, each thread must have a private copy of 
some variables, like i in the above example. Loop 
variables are made private by default but other 
such variables have to be declared "private". 
Some variables may need to be summed over all 
the sites. This is accomplished with a reduction 
statement. The syntax is as follows: 

j=0; 

#pragma omp parallel for reduction(+ : j ) 

for(i=0;i<N;i++){ 

j+=my_f unction(i) ; 

} 

This is equivalent to 



j = 5Zi=0 mv -f unction (i). 

Note that the sum is performed over all threads 
though each thread works only on part of the total 
number of iterations. 

Identifying private and reduction variables 
is necessary for getting correct results. 

3. MILC CODE AND CHANGES 

The MILC J2| code is a set of publicly avail- 
able codes developed by the MIMD Lattice Com- 
putation (MILC) collaboration for doing QCD 
simulations. This code has been run on a vari- 
ety of parallel computers, using MPI, for many 
physics projects. The files are organized in differ- 
ent directories — the libraries directory con- 
tains low level routines like matrix multiplica- 
tion, the generic directory contains oft-needed 
but somewhat higher level routines, including the 
updating and inversion routines. Then there are 
various application directories. For this project, 
we only concentrated on the conjugate gradient 
inverter, file d_congrad5.c in generic_ks direc- 
tory in version 6 of MILC code. The code uses a 
macro FORALLSITES defined as 

#define FORALLSITES ( i , s) \ 

f or (i=0 , s=lattice ; i<sites_on_node ; \ 

i++,s++) 
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where lattice is an array of sites, site is a struc- 
ture containing variables defined at each lattice 
point and sites_on_node is the number of lattice 
points on a given node. We needed to redefine the 
macro FORALLSITES because the OpenMP com- 
piler we used could not deal with two variables (i 
and s) in a parallel for statement. Here is the 
macro redefinition. 

#define FORALLSITES (i,s)\ 

f or (i=0 ; i<sites_on_node ; i++){\ 

s=&(lattice [i] ) ; 

We used another macro END_L0DP, which is just 
defined to be a closing brace } to match the open- 
ing brace in the above macro. 

4. COMPILING AND RUNNING 

We used the KAP/Pro toolset j| for this 
project. It includes the following: 

• guidec: OpenMP compiler for C. 

• guideview: OpenMP parallel performance 
visualization tool. It gives details of pro- 
gram execution, in particular, time spent 
in serial and parallel execution, imbalance 
in different regions of the code, etc. 

• assurec: Compiler to be used with debug- 
ger which works by comparing single thread 
and multiple thread executions. 

• assureview: OpenMP programming cor- 
rectness tool for viewing details of errors 
or conflicts which occur if different threads 
try to read/ write the same variables at the 
same time. 

To add OpenMP parallelism to the MILC code, 
the following steps were required. First, we had 
to redefine the macro as explained above. Then 
we added the parallel for pragmas, specifying 
private and reduction variables. For example, 
in the FORALLSITES loop, s was made private. 
We changed cc to guidec in our makefiles, and we 
had to modify those compiler options that guidec 
did not recognize. Adding — backend before a 
compiler option forces guidec to use cc compiler 
options. Then, we ran assurec and assureview 



to locate and remove conflicts. Finally, we ran 
the executable on different number of threads and 
verified that the output agreed with the MPI out- 
put. 

Even after one has a working OpenMP code, 
there are some issues to consider when comparing 
its performance with that of MPI. Some perfor- 
mance problems are OpenMP issues, while oth- 
ers are not. If single thread OpenMP perfor- 
mance does not match that with a single node 
under MPI, that may indicate a culprit other 
than OpenMP. For example, thread safe compi- 
lation requires the -mt switch on Sun. If using 
this switch on the original serial code decreases 
performance substantially, then the performance 
issue lies with the Sun compiler and its runtime li- 
braries, not with OpenMP. If the code uses many 
malloc/free pairs then thread-safe memory al- 
location is likely the culprit. Again, this is not 
an OpenMP issue, but an issue with the quality 
of the vendor's thread-safe compiler/runtime im- 
plementation. We verified that the -mt option on 
the serial version on Sun did not affect the per- 
formance significantly. Except for the case where 
we combine OpenMP and MPI, no malloc/free 
statements are used in the region of the code 
where the performance is evaluated. Thus, to the 
best of our knowledge, this is a fair comparison 
between OpenMP and MPI performance. 

5. RESULTS FOR SUN E10000 AND 
BLUE HORIZON 

We first ran the modified codes on a Sun 
El 0000 at Indiana University. The details of ar- 
chitecture for this computer can be found online 
Benchmarks were done for various lattice 
sizes and numbers of threads. As the number 
of threads increased, the lattice dimensions were 
increased to keep the volume per thread constant 
at L 4 . The number of threads N was increased 
from 1 to 16 by factors of 2. For example, for a 
given L, the lattice size for 2 threads is L 3 *2L and 
for 16 threads it is (2L) 4 . L was increased from 
4 to 14 in steps of 2. Reported in Fig. 1 is the 
performance on the Kogut-Susskind quark con- 
jugate gradient routine in Megaflop/s per CPU 
for both OMP and MPI. For smaller number of 
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threads/nodes, the OMP rates are quite compa- 
rable to MPI. They lag behind for larger number 
of threads. There are two factors involved here: 
the overhead for setting up threads and the use 
of the cache. For small lattice sizes, since there is 
only a small number of computations to be per- 
formed, the former degrades the performance, but 
if significant portion of the problem can fit in the 
cache, the execution is speeded up. On the other 
hand, for larger lattices the thread initialization 
overhead is a much smaller fraction of the total 
computation time, but the problem size is too 
big to fit into the cache. We see that OMP has a 
"sweet spot" at size 6, much as the MPI perfor- 
mance peaks at size 8. Since we keep the load per 
thread constant, for the same lattice size the per- 
formance monotonically decreases in most cases 
as we increase the number of threads. 
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Figure 2. Comparison between OpenMP and 
MPI performance on Blue Horizon. The open 
symbols correspond to OMP and the filled sym- 
bols to MPI. 
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Figure 1. Comparison between OpenMP and 
MPI performance on Sun E10000. The open sym- 
bols correspond to OMP and the filled symbols to 
MPI. 



6. COMBINING OMP AND MPI 

A hybrid approach combining OpenMP paral- 
lelism within MPI processes may offer better per- 
formance than either individual approach. We 
tried different combinations of threads and MPI 
processes on Blue Horizon. The hybrid approach 
fared better at times. Figure 3 shows the results 
for a total of eight processors. It can be seen 
again that the MPI performance peaks at size 8 
(the left-most bars in Fig. 3) and the OMP at 
size 6 (the right-most bars). The combination of 
2 threads and 4 nodes works best for smaller sizes. 
The processors on Blue Horizon were upgraded 
after these runs. We should repeat these calcula- 
tions and extend the study to a larger number of 
CPUs. 



Next we benchmarked the code on Blue Hori- 
zon ||. This IBM SP machine at the San Diego 
Supercomputer Center has 8-way SMP nodes but 
with the current switch can support only 4 MPI 
processes per node. Figure 2 contains the prelim- 
inary results from Blue Horizon. These results 
are qualitatively similar to the E10000 results. 



7. CONCLUSION 

On both computers studied, OpenMP perfor- 
mance was very similar to MPI performance for 
a small number of threads, but it deteriorated 
much faster as the number of threads increased, 
for smaller lattice sizes. Thus, OpenMP may be 
a viable option for someone writing a code to be 
used with a modest number of processors on SMP 
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Figure 3. Combining OpcnMP and MPI on Blue 
Horizon. 
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machines. The MILC Collaboration, however, al- 
ready has a working MPI code that scales well on 
many machines. For almost all the combinations 
of problem sizes and number of CPUs studied in 
this paper, MPI is at least as good as OpenMP, if 
not better. The only case where we get a consid- 
erable improvement over MPI is when we combine 
OpenMP and MPI on Blue Horizon for L = 4 
and 6. Not only does the hybrid approach give 
the best performance on a single SMP node, it 
should allow us to run multi-node jobs using all 
eight processors on each node rather than the 
limit of four with the current switch. We have 
added OpenMP parallelism to the MILC code 
only for the conjugate gradient inverter for this 
test project. It will require considerably more ef- 
fort to modify the whole code to run on multiple 
OpenMP threads. 
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